# You can comment your code starting a line with `#`
# 1 + 1
# You have basic operators such as
1 + 1 / 2[1] 1.5
[1] FALSE
[1] FALSE
by the Svardal lab, based on material by Alexandros Bantounas and contributions by Curro Campuzano 1.
R is a programming language for statistical computing and data visualization.1
It has a very rich ecosystem of packages (i.e. collections of pre-written code) to perform statistical and genomic analysis.
R in the command line.Rscript script.R > output.txt in the command line.If you can’t install R:
You can play with a limited version of RStudio that runs in your browser1 at https://webr.r-wasm.org/latest/
Or execute code directly in the cluster.
Has anyone had any problems?
Define/create a folder to be used as the working directory.
Open R Studio and create a new Script file (menu). You can also create a project (button top right).
Set the working directory to your prepared folder.
Write your script in the script window and save it. Send selected code line(s) to the console using ctrl+Return (PC).
Conduct analyses, save the script, outputs, and graphs. When the entire analysis is ready, you can compile code and output into a notebook.
In R, almost everything is an atomic vector, a list, or a function.
[1] 1 2 3
[1] TRUE FALSE TRUE
[1] "A" "C" "T" "G"
Functions in R are reusable blocks of code that take inputs (arguments), perform a specific task, and return an output.
Making your own functions allows you to automate common tasks in a more powerful way than copy-pasting.
Loops allow us to iteratively apply a function on a list of inputs. The main loop used in this tutorial is the for loop 1:
R packages are sets of custom functions and object classes that can be installed and used. Most R packages are deposited in the CRAN repository1.
The R language has evolved quite a lot since it was created. 1 A “modern” style of writing R code is promoted by the tidyverse package.
The syntax library(package_name) attaches names to your active session and lets you refer to them.
Often, you want to load data generated outside your R session (by others or a genomics pipeline)1. Tables are encoded as data frames, which are lists of equal-length vectors.
[1] 152 17
[1] "studyName" "Sample Number" "Species"
[4] "Region" "Island" "Stage"
[7] "Individual ID" "Clutch Completion" "Date Egg"
[10] "Culmen Length (mm)" "Culmen Depth (mm)" "Flipper Length (mm)"
[13] "Body Mass (g)" "Sex" "Delta 15 N (o/oo)"
[16] "Delta 13 C (o/oo)" "Comments"
Rows: 152
Columns: 17
$ studyName <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
$ `Sample Number` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ Species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P…
$ Region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
$ Island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse…
$ Stage <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu…
$ `Individual ID` <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
$ `Clutch Completion` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
$ `Date Egg` <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16,…
$ `Culmen Length (mm)` <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34…
$ `Culmen Depth (mm)` <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18…
$ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,…
$ `Body Mass (g)` <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34…
$ Sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"…
$ `Delta 15 N (o/oo)` <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18…
$ `Delta 13 C (o/oo)` <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.298…
$ Comments <chr> "Not enough blood for isotopes.", NA, NA, "Adult…
x <- df[1, ] # Accessing the first row.
x <- df[, 1] # Accessing the first column.
df[, "studyName"] # Accessing the column by name# A tibble: 152 × 1
studyName
<chr>
1 PAL0708
2 PAL0708
3 PAL0708
4 PAL0708
5 PAL0708
6 PAL0708
7 PAL0708
8 PAL0708
9 PAL0708
10 PAL0708
# ℹ 142 more rows
# A tibble: 1 × 1
Species
<chr>
1 Adelie Penguin (Pygoscelis adeliae)
For more advanced data manipulations, you can use functions from the dplyr package and chain operations by passing the output of one function as input to the next one using the %>% pipe operator.
dplyr manipulationCould you guess what is happening exactly?
df %>%
mutate(Sex = tolower(Sex)) %>%
filter(Sex == "female") %>%
filter(Island %in% c("Torgersen", "Biscoe", "Dream")) %>%
filter(!is.na(Stage)) %>%
select("Island", starts_with("Culmen")) %>%
slice_sample(n = 5)# A tibble: 5 × 3
Island `Culmen Length (mm)` `Culmen Depth (mm)`
<chr> <dbl> <dbl>
1 Dream 38.1 18.6
2 Torgersen 39 17.1
3 Biscoe 39.6 17.7
4 Torgersen 35.9 16.6
5 Dream 36.8 18.5
dplyr manipulationdf2 <- df %>%
# mutate() is used to create columns
mutate(Sex = tolower(Sex)) %>%
# filter() by column Value
filter(Sex == "female") %>%
# filter() by list of values
filter(Island %in% c("Torgersen", "Biscoe", "Dream")) %>%
# filter() by missing values
filter(!is.na(Stage)) %>%
# select() certain columns by index, name or pattern
select("Island", starts_with("Culmen")) %>%
# Take a random sample of rows
slice_sample(n = 5)In base R, there are many convenient plots that just “work” when you attempt to plot different objects. However, for final plots, it is not always the most convenient.
ggplot# Set a theme
theme_set(ggthemes::theme_tufte())
df %>%
# Discard individuals with unknown sex
filter(!is.na(Sex)) %>%
# Create a plot with Body mass in the x-axis and fill by Sex
ggplot(aes(x = `Body Mass (g)`, fill = Sex)) +
# Plot an histogram
geom_histogram(color = "black", alpha = 0.8) +
xlab("Count") + # X-axis label
ggtitle("Histogram example") # Add titleggplotggplotdf %>%
filter(!is.na(Sex)) %>%
ggplot(aes(x = `Flipper Length (mm)`, y = `Body Mass (g)`, colour = Sex)) +
geom_smooth(method = lm, linetype = "dashed") +
geom_point(shape = 1) +
facet_wrap(~Island) +
ggtitle(
label = "Flipper Length versus Body Mass",
subtitle = "We don't see differences between islands but between sexes in Adelie Penguin"
)ggplot